5,480 research outputs found
Learning Space-Time Semantic Correspondences
We propose a new task of space-time semantic correspondence prediction in
videos. Given a source video, a target video, and a set of space-time
key-points in the source video, the task requires predicting a set of keypoints
in the target video that are the semantic correspondences of the provided
source keypoints. We believe that this task is important for fine-grain video
understanding, potentially enabling applications such as activity coaching,
sports analysis, robot imitation learning, and more. Our contributions in this
paper are: (i) proposing a new task and providing annotations for space-time
semantic correspondences on two existing benchmarks: Penn Action and Pouring;
and (ii) presenting a comprehensive set of baselines and experiments to gain
insights about the new problem. Our main finding is that the space-time
semantic correspondence prediction problem is best approached jointly in space
and time rather than in their decomposed sub-problems: time alignment and
spatial correspondences
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body
keypoints in complex, multi-person video. We propose an extremely lightweight
yet highly effective approach that builds upon the latest advancements in human
detection and video understanding. Our method operates in two-stages: keypoint
estimation in frames or short clips, followed by lightweight tracking to
generate keypoint predictions linked over the entire video. For frame-level
pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D
extension of this model, which leverages temporal information over small clips
to generate more robust frame predictions. We conduct extensive ablative
experiments on the newly released multi-person video pose estimation benchmark,
PoseTrack, to validate various design choices of our model. Our approach
achieves an accuracy of 55.2% on the validation and 51.8% on the test set using
the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art
performance on the ICCV 2017 PoseTrack keypoint tracking challenge.Comment: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint
tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack
and webpage: https://rohitgirdhar.github.io/DetectAndTrack
MEXSVMs: Mid-level Features for Scalable Action Recognition
This paper introduces MEXSVMs, a mid-level representation enabling efficient recognition of actions in videos. The entries in our descriptor are the outputs of several movement classifiers evaluated over spatial-temporal volumes of the image sequence, using space-time interest points as low-level features. Each movement classifier is a simple exemplar-SVM, i.e., an SVM trained using a single positive video and a large number of negative sequences. Our representation offers two main advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that even simple linear classification models trained on our global video descriptor yield action recognition accuracy comparable to the state-of-the-art. Because of the simplicity of linear models, our descriptor can efficiently learn classifiers for a large number of different actions and to recognize actions even in large video databases. Experiments on two of the most challenging action recognition benchmarks demonstrate that our approach achieves accuracy similar to the best known methods while performing 70 times faster than the closest competitor
- …